This course “IODS-Introduction to Open Data Science” is organized by the Doctoral School of Social Sciences and Centre for Research Methods, University of Helsinki. The theme of the course is open data, open science and data science. The aim of this course is to understand the concepts of open data and open science and learn how to use some of the open tools available (=data science). Students will learn to analyze and visualize data with sophisticated statistical methods with these open access tools and get familiar with the concept of reproduciable science.
Link to my GitHub repository: https://github.com/anninahaverinen/IODS-project
Annina Haverinen, 7.11.2018 IODS-Project, Excercise 2
Getting started: Reading the data “learning2014” from local disk, the original data has been wrangled.
The learning2014 is a subset of a data from a study investigating the realationship between learning approaches and student achievements. The data learning2014 consists of 7 variables: gender (F/M), age, exam points, global attitude towards statistics, questions on surface learning, deep learning and strategic learning.There are 166 observations in this dataset representing 166 students.
learning2014<-read.table("/Users/Annina/Documents/GitHub/IODS-project/Data/learning2014.txt")
Looking at the dimensions,structure and first observations of the data.
dim(learning2014) # gives the dimensions of the dataset: 166 observations of 7 variables.
## [1] 166 7
str(learning2014) # gives the names of the variables, type of variable (factorial with number of levels, numeric or integer) and the first observations
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ surf : num 2.64 3.09 2.18 2.27 2.82 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
Next a summary of the variables:
summary(learning2014)
## gender Age Points Attitude surf
## F:110 Min. :17.00 Min. : 7.00 Min. :14.00 Min. :1.545
## M: 56 1st Qu.:21.00 1st Qu.:19.00 1st Qu.:26.00 1st Qu.:2.455
## Median :22.00 Median :23.00 Median :32.00 Median :2.818
## Mean :25.51 Mean :22.72 Mean :31.43 Mean :2.783
## 3rd Qu.:27.00 3rd Qu.:27.75 3rd Qu.:37.00 3rd Qu.:3.091
## Max. :55.00 Max. :33.00 Max. :50.00 Max. :4.273
## deep stra
## Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.625
## Median :3.667 Median :3.188
## Mean :3.680 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :4.917 Max. :5.000
From the summary we can see descriptive information of the variables. For factoral variables the quantity of the observations/level is given and for numeric variables the min-max values, 1st and 3rd quantile, mean and median. From this we can get an overview of the distribiution of the data/variable. From the summary it can be seen that “age” is skewed but the other variables quite normally distributed (normal distribiuton: mean almost equal to the median, quntiles symmetrical around mean, min-max symmetrical).
To get an idea over the possible relationships between the variables in the data we plot them in a plot matrix.
library(ggplot2) #installed for graphics
library(GGally) #installed for graphics
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20))) # plot matrix of the data
The plot matrix gives us scatter plots of all possible combinations of the variables in the data fram. The matrix also gives correlation coefficients for the combinations and density lines for estimation of distribition. We can visually interpret the distribution of the variables, in this case the distribiutions are more or less normal for the other variables than “age”. Below a correlation matrix.
p2<- ggcorr(learning2014 [-1]) # plots a correlation matrix, helps to visualize the correlated variables in the data set
p2
From these plots we can see that see that the “Attitude”-“Points” interaction has the highest correlation coefficient. We see that “Attitude” and “stra” are positively correlated with “Points” and that “deep” is slightly negatively correlated with “Points”.
In this multiple regression model (named LM1) I therfore include “Attitude”, “stra” and “deep” (3 variables required in the excercise) as explanatory variables to “Points”. In the next model (LM2) I include only “Attitude” and “stra”, these two being positvely correlated to “Points”. The last model (LM3) is a simple linear regression model with only one explanatory variable, “Attitude”.
LM1<-lm(Points~Attitude+stra+deep, data=learning2014)
LM1
##
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = learning2014)
##
## Coefficients:
## (Intercept) Attitude stra deep
## 11.3915 0.3525 0.9621 -0.7492
summary(LM1) #metrics indicating the quality of the model fit
##
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.39145 3.40775 3.343 0.00103 **
## Attitude 0.35254 0.05683 6.203 4.44e-09 ***
## stra 0.96208 0.53668 1.793 0.07489 .
## deep -0.74920 0.75066 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
confint(LM1) # confidence intervalls of the coefficient estimate, from this we can also tell if the coeffiecent estimates are significant in our model.
## 2.5 % 97.5 %
## (Intercept) 4.66211614 18.1207860
## Attitude 0.24030902 0.4647614
## stra -0.09769802 2.0218634
## deep -2.23153872 0.7331395
Interpretation of the multiple linear regression model LM1:“Attitude”,“stra” and “deep” as explanatory variables for “Points”.
Residuals: (=how far from the regression line the observations are) shows how well data fits the model. The residual median should be close to mean (mean =0), and residuals should be normally distribiuted.
The Coefficient estimates are values of the slope calculated by the model regression. Signifcance levels for the coefficients estimates for the variables of the model are given with symbols and we see that the coefficient estimate is significant (*** = p = 0 ) for “Attitude”" and (“.” p= 0.05)" for “stra”. The estimate coefficient for “deep” is not significant for the model. High significance means that it is unlikely that “Attitude” is not correlated with “Points”.
Standard error of the coefficient estimate, should be small, but in relation to the coefficient estimate. It tells about the variance of the estimate coefficient.
T value, and p for T value tells that attitude coefficient estimate is valuable for the model.The other NS.
Residual standard error is a measure of variability of the residuals. The standard deviation of the residual, tells how well data fits model. It should be proprotional to the reisdual quantiles, normal dist stand errorx1.5 = the quantiles. In this case the SE of the residual is well in proportion to the quantiles.
Evaluation of “Goodness-of-fit of the model” Adjusted R squared is 19.5% (adjusted because of multiple regression, non adjusted can be used in simple regression). Meaning that 19.5% of the “Points”- can be explained by this model including the variables “Attitude”,“stra” and “deep”. High R squared indicates qood correlation, low bad. This tells how much of the phoenomena can be explained by the model = the strenght of the relationships between the model and the variables.
F test of the overall significance of the model: p value for the F statistic should be smaller than the P value for the coefficient estimate. Determines weather the Rsquared relationship is statistically significant. In this model the F p is bigger than the T p for “attitude”.
From the given output on the model I would exlude “deep” from the multiple regression model and make the model simpler with only two explinatory models.
LM2: “Attitude” and “stra” as explainatory variable for “Points”
LM2<-lm(Points~Attitude+stra, data=learning2014)
LM2
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learning2014)
##
## Coefficients:
## (Intercept) Attitude stra
## 8.9729 0.3466 0.9137
summary(LM2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
confint(LM2) # gives confidence intervalls of the coeffiecient estimate
## 2.5 % 97.5 %
## (Intercept) 4.2418686 13.7039310
## Attitude 0.2349813 0.4581806
## stra -0.1417267 1.9690301
LM2: The Rsquared adjusted is the same 19.5% but the F statistic is higher 21, with smaller p value, indicating a better fit of this model than LM1.
LM3: Next a simple regression model with only one explanatory variable “Attitude” for “Points”
LM3<-lm(Points~Attitude,data=learning2014)
LM3
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Coefficients:
## (Intercept) Attitude
## 11.6372 0.3525
summary(LM3)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
confint(LM3) # gives confidence intervalls of the coeffiecient estimate
## 2.5 % 97.5 %
## (Intercept) 8.0230738 15.2512331
## Attitude 0.2405135 0.4645785
LM3: Rsquared is essentieally unaffected 19% but F statistic 38.61, with a p value smaller that the estimate coefficient of “Attitude”. I would conclude that this simple model is better than those with multiple explinatory variables.
Below a scatterplot with the regression line (with CI95% pictured light grey around the regression line) plotted for the LM2 and LM3 models.
qplot(Attitude+stra, Points, data = learning2014) + geom_smooth(method = "lm")
qplot(Attitude, Points, data = learning2014) + geom_smooth(method = "lm")
Validation of the data for the model: Below the regression models LM2 and LM3 are subjected to diagnostic plots.
plot(LM2, which = c(1,2,5))
plot(LM3, which= c(1,2,5))
The diagnostic test are used to see how well our data fits the model. 1. Residual vs fitted values: Shows the residuals variance of error. In this case quite a good fit, we see no pattern in in the scatter plot. 2. Normal Q-Q plot: shows if residuals from our model are normally distribiuted, witch is an assumtion for the model. In this case some deviation from normal can be seen at extremes, but I interpret that data fits model well enough. 3.Residuals vs leverage: Helps us find influencial subjects that could have an impact on the regression line (=not all outliers are influencial i.e determine the regression line). Here no cases are seen outside cooks distance, i.e. there are not influencial extremes in the data.
The interpretation is that the data fits the regression model (both the multiple (2 variables) and simple regression model) and that the model is valiated for the data. ***
This exercise is based on a data set about student achievement in secondary education in two Portuguese schools. Data includes student grades,demographic, social and school related features and it was collected by using school reports and questionnaires. The data “alc” is a subset of this data, with alcohol use on weekdays and weekends combined by averaging and high use defined >2 (scale: 1 = very low to 5 very high). Links to original dataset and publication.
Readning the wrangled data from local disk.
alc<-read.table("/Users/Annina/Documents/GitHub/IODS-project/Data/alc.txt",sep=";")
Next lets explore the data:
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data set is composed of 35 variables and 382 observations(ie.382 students)
summary(alc$high_use) # = alc_use >2 = TRUE
## Mode FALSE TRUE
## logical 268 114
The response variable of interest is the binary variable “high_use” (TRUE/FALSE) correspoinding to alcohol use >2 on a scale from 1-5.
My hypotesis is that high alcohol use is correlated to the variables; “goout”,“sex” and “abscences” and that these could be used in a model for predicting high use. My hypothesis is that male students drink more than female students. People who go out more drink more alcohol and that people that have more abscences from school drink more alcohol.
Lets look at the chosen variables:
p1 <- ggplot(alc, aes(x=high_use, y = goout, col=sex))+geom_boxplot()+ylab("Going
out")+xlab("Alcohol high use")+ggtitle("Student going out by alcohol use and sex")
p1
In this boxplot we can se that male students go out more than female and that more going out seems positively related to high alcohol use.
p2 <- ggplot(alc, aes(high_use, absences, col=sex))+ylab("Absences")+xlab("Alcohol high use")+geom_boxplot()+ggtitle("Alcohol use by student absences and sex")
p2 #numeric 0-93 days
In this boxplot it looks like the students having high alcohol use have somewhat more absences.
table(alc$high_use,alc$sex)
##
## F M
## FALSE 156 112
## TRUE 42 72
In this table we see that of the male students 38% and 21% of the women fall in the high alcohol use group. Of the whole data 30% are in the high use group.
p3 <- ggplot(alc, aes(high_use, col=sex))+geom_bar()+ylab(
"Students")+xlab("Alcohol high use")+ggtitle("Alcohol high use by sex")
p3
In this barplot we se that more men than women are in the high use group.
model1<- glm(high_use~absences+sex+goout,
data=alc, family="binomial")
summary(model1)
##
## Call:
## glm(formula = high_use ~ absences + sex + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
coef(model1)
## (Intercept) absences sexM goout
## -4.16317087 0.08418399 0.95871896 0.72980859
ORmodel1<-coef(model1)%>%exp
CImodel1<-confint(model1)%>%exp
## Waiting for profiling to be done...
ORCImodel1<-cbind(ORmodel1,CImodel1)
ORCImodel1
## ORmodel1 2.5 % 97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## absences 1.08782902 1.042458467 1.13933894
## sexM 2.60835292 1.593132148 4.33151387
## goout 2.07468346 1.650182481 2.64111050
The residuals are more or less normally distribiuted and the median is close to 0. The coeffients for the variables all show statistical significance in the model. This means these are significantly predictive variables of high alcohol use.
Male sex is associated with a 2.6 times larger liklihood of high alcohol use than female sex. “goout” is associated with 2.07 times the odds of high alcohol use per increase in unit of “go out” and absences with 1.09 times the odds with one increase of unit in absences.
probabilities <- predict(model1, type="response")
alc<-mutate(alc,probability=probabilities)
alc<-mutate(alc, prediction =probability>0.5)
dim(alc)
## [1] 382 37
select(alc, sex,goout,absences,probability,prediction)%>%tail(10)
## sex goout absences probability prediction
## 373 M 2 0 0.14869987 FALSE
## 374 M 3 7 0.39514446 FALSE
## 375 F 3 1 0.13129452 FALSE
## 376 F 3 6 0.18714923 FALSE
## 377 F 2 2 0.07342805 FALSE
## 378 F 4 2 0.25434555 FALSE
## 379 F 2 2 0.07342805 FALSE
## 380 F 1 3 0.03989428 FALSE
## 381 M 5 4 0.68596604 TRUE
## 382 M 1 2 0.09060457 FALSE
table(high_use=alc$high_use,prediction=alc$prediction) #confusion matrix #302 predictions wright/ 382 observations
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 65 49
From this confusion matrix we can calculate that this model predicts 302/382 predictions right. 79% right, 21% wrong predicitions.
g <- ggplot(alc, aes(x = probability, y = high_use,
col=prediction))+geom_point()
g
In this picture we see the same thing, the observations plotted by the probability and prediction and high_use. From this plot we can see that the model predicts low alcohol better than high alcohol use.
table(high_use = alc$high_use, prediction =
alc$prediction)%>%prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
Loss of function: wrongly classified observations with this model
loss_func<-function(class,prob){
n_wrong <-abs(class-prob)>0.5
mean(n_wrong)
}
#mean number of wrongly classified observations = mean prediction error; 24,6% wrong predictions with this model on training data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293
20.9% are classified wrong by the prediction of this model. Guessing produces 29,8% wrong predicitions. This model is better than simply guessing.
library(boot)
cv <-cv.glm(data = alc, cost = loss_func, glmfit = model1, K = 10)
cv$delta[1]
## [1] 0.2172775
22% is the mean predicition error of the model.
Annina Haverinen 19.11.2018
Loading the MASS “Boston” dataset already loaded in R. Link to dataset
This is a data set about Housing Values in Suburbs of Boston.
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81-102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980). Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley
Variables in the dataset:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
dim(Boston) #506 observation of 14 variables
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Names of the variables in the data
506 observation of 14 variables
gather(Boston) %>%
ggplot(aes(value)) + facet_wrap("key", scales= "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Visualisation of the variables. the variable rm is the only one looking normally distribiuted, the others seem rather skewed.
library(tidyr)
library("ggplot2")
library("GGally")
library("corrplot")
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)%>%round(2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
p1 <- ggcorr(cor_matrix,geom="circle")
# correlation plot matrix
p1
?ggcorr
p2<- corrplot(cor_matrix, method = "circle",type="upper",cl.pos="b",tl.pos="d",tl.cex = 0.6)
p2
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
Looking at the correlations plot we can see that positively correlated variables are; rad (index of accesibility to radial highways), lstat (lower status of the population, percent) tax (full-value property-tax rate per $10,000) to crim (our variable of interest is crimerate per capita by town). Negatively associated with crim are medv (median value of owner-occupied homes in $1000s) and dis (weighted mean of distances to five Boston employment centres).
scaled_boston <- scale(Boston)
summary(scaled_boston)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
summary(scaled_boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
crim_vector <- quantile(scaled_boston$crim) #new crime variable
crim_vector
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <-cut(scaled_boston$crim, breaks = crim_vector, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
scaled_boston <- dplyr::select(scaled_boston, -crim)
scaled_boston <- data.frame(scaled_boston,crime)
n <- nrow(scaled_boston)
n
## [1] 506
train <- sample(n, size = n * 0.8)
train_set <- scaled_boston[train,] # training set with 80% of obs.
dim(train_set)
## [1] 404 14
test <- scaled_boston [-train,] # test set with 20% of obs.
dim(test)
## [1] 102 14
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda1<-lda(crime~., data=train_set)
lda1
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2549505 0.2574257 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 1.0528366 -0.9749369 -0.11163110 -0.9020874 0.51845273
## med_low -0.1059251 -0.3396443 -0.04298342 -0.5648400 -0.13206300
## med_high -0.3948894 0.1848484 0.18195173 0.3793551 -0.02154002
## high -0.4872402 1.0171737 -0.03371693 1.0707206 -0.39658083
## age dis rad tax ptratio
## low -0.9156278 0.9727477 -0.6912387 -0.7567062 -0.46493272
## med_low -0.3321160 0.3056852 -0.5436697 -0.4821863 -0.01802759
## med_high 0.4130206 -0.3554782 -0.4761040 -0.3510220 -0.21485482
## high 0.8092204 -0.8485658 1.6375616 1.5136504 0.78011702
## black lstat medv
## low 0.38257007 -0.79565744 0.60421080
## med_low 0.31804177 -0.16601039 0.01817895
## med_high 0.06424352 0.08025286 0.06187999
## high -0.74252566 0.91278445 -0.66681928
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06913123 0.62253739 -0.87642543
## indus 0.12303584 -0.30386020 0.17547140
## chas -0.04515286 -0.04074817 0.06291740
## nox 0.31124788 -0.66912601 -1.48795046
## rm 0.04157514 -0.01563936 -0.25867848
## age 0.13352783 -0.31042971 -0.18910935
## dis -0.08030070 -0.14383989 -0.11632934
## rad 3.94011359 0.97891073 -0.07836587
## tax 0.10680198 0.06264412 0.69649928
## ptratio 0.12930532 -0.06535272 -0.27320048
## black -0.05030945 0.05298142 0.13157895
## lstat 0.21240889 -0.20114242 0.14195983
## medv 0.07239768 -0.30023249 -0.24684186
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9615 0.0296 0.0090
classes <- as.numeric(train_set$crime) # 1,2,3,4 in stead of low,med_low,med_high, high
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda1, myscale = 1)
##Predict with LDA
# predict classes with test data
lda.pred <- predict(lda1, newdata = test)
# cross tabulate the results
table(correct = lda.pred$class, prediction = correct_classes)%>%addmargins()
## prediction
## correct low med_low med_high high Sum
## low 15 3 0 0 18
## med_low 12 15 4 0 31
## med_high 2 5 14 0 21
## high 0 0 4 28 32
## Sum 29 23 22 28 102
Prediction was right in 56% of test set
high crime was prdicted in 16/18 cases
med_high crime was predicted in 18/24
high crime was predicted in 26/27
library(MASS)
data("Boston")
scaled_boston <- scale(Boston)
#determining the number of clusters
wss <- (nrow(scaled_boston)-1)*sum(apply(scaled_boston,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_boston,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
km<- kmeans(scaled_boston,3)
summary(km)
## Length Class Mode
## cluster 506 -none- numeric
## centers 42 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
km1 <- kmeans(scaled_boston, 5)
summary(km1)
## Length Class Mode
## cluster 506 -none- numeric
## centers 70 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
3 clusters would be appropriate.
pairs(scaled_boston, col=km$cluster)
data("Boston")
scaled_boston <- scale(Boston)
dim(scaled_boston)
## [1] 506 14
km_bonus <- kmeans(scaled_boston, 4)
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
lda2 <- lda(km_bonus$cluster~., data = scaled_boston)
lda2
## Call:
## lda(km_bonus$cluster ~ ., data = scaled_boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.3616601 0.2608696 0.2173913 0.1600791
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3909812 -0.2256428 -0.6123014 -0.07870119 -0.5097893 0.3443714
## 2 1.0632703 -0.4872402 1.0149946 -0.03371693 1.0159127 -0.3735788
## 3 -0.3215538 -0.4823678 0.6058983 0.19296460 0.4710723 -0.5944241
## 4 -0.4127310 1.9588740 -1.0935425 -0.02929819 -1.1435430 0.6380135
## age dis rad tax ptratio black
## 1 -0.3398029 0.2457676 -0.5676700 -0.7140001 -0.29244580 0.36383532
## 2 0.7542188 -0.8233749 1.6596029 1.5294129 0.80577843 -0.75124560
## 3 0.6869197 -0.5438170 -0.5851280 -0.2061098 0.06253067 0.03661229
## 4 -1.3942484 1.5250604 -0.6274061 -0.5993631 -0.73732772 0.35253338
## lstat medv
## 1 -0.5489002 0.4511972
## 2 0.8328654 -0.6664074
## 3 0.5963120 -0.4822750
## 4 -0.9269606 0.7215673
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.0802880132 -0.188233567 -0.12940945
## zn 0.2778536432 -1.307489935 1.28053403
## indus 0.2885719633 0.232092639 0.66603797
## chas -0.0233666154 -0.003197753 0.22741150
## nox 0.0001974368 0.093699623 0.35927729
## rm -0.0538094053 -0.188523052 -0.15839928
## age 0.0383469023 0.635299039 0.17994768
## dis -0.2692110931 -0.530377962 -0.01103552
## rad 5.9895792739 -1.226336654 -1.48557442
## tax 0.2063385825 -0.171020064 0.83656786
## ptratio 0.2700183626 0.161060370 0.31935417
## black -0.0385016887 -0.022900357 -0.08539887
## lstat 0.0616549825 0.182880501 0.40175464
## medv -0.0752439777 -0.077267369 -0.01891136
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8947 0.0879 0.0174
classes1 <- as.numeric(km_bonus$cluster)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda2, dimen = 2, col = classes1, pch = classes1)
lda.arrows(lda2, myscale = 1)
##superbonus
model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda1$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda1$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_set$crime)